Author
|
Topic: Validity -
|
rnelson Member
|
posted 07-27-2006 10:36 PM
This is from another thread, but I thought I'd start a new topic. quote: When Don Krapohl talks about "validated" techniques, he's talking about a technique for which there exists a body of literature (usually, at least three independent, peer-reviewed studies) demonstrating how the technique performs in real use. The ZCT with SKY, to my knowledge, doesn't meet that burden.Keep in mind that a "validated" test doesn't mean it's any good, or the best, it just means what I said above. (You could have three techniques, one validated at 36% accurate (e.g., voice stress), a ZCT at 55% (e.g., some screening tests) and a single-issue ZCT at 90% (e.g., the Utah or Federal)- all are valid, but only the latter works at a rate I'm satisfied with.) Granted, those are different types of tests, but the same can be said if single-issue tests and scoring systems.
Validated is not validity. I don't disagree with what you are saying Barry, or Don, but I do think there is a lot of room for further instruction or discussion here. I've heard or seen this answer before, and I always have the impression that the discussion is incomplete regarding the stipulated meanings of empirical terms such as validated, validity, and reliability. While validated refers to the existence and quality of empirical research to support an assumption or test method, Validity generally refers to construct validity which in all testing sciences really means “how well does the test measure what we say it measures” (i.e., lies and truth, cholesterol, spelling skill, etc.). This depends in part upon our operational and epistemological definition of the constructs of deception and truth. Because deception and non-deception are amorphous and intangible constructs, (i.e., they are not things, or physical substance, themselves) we are forced to infer there presence or absence through statistical correlates with measurable physiological reaction phenomena. Because all phenomena associated with human physiology have multiple functions, there is not one single physiological reaction phenomena that is uniquely correlated with deception or any other human function. That is what is fundamentally wrong with voice-stress (aside from the inadequate construct-validity of micro-tremor reactions); the voice stress method depends upon a single physiological reaction feature. Single features are diagnostic only when they are either abnormal (i.e., not normally present), or when they are statistically normed. For example the simple presence of the gonoccocus bacteria is diagnostic (you've got gonorrhea). On the other hand, phenomena such as cholesterol, blood pressure changes, electrodermal activity, and breathing, are normally occurring. For normally occurring phenomena, it is the degree of prevalence (as in cholesterol). For example: how much cholesterol is too much? That is sometimes as much a matter of policy as it is a matter of science. When it becomes easier (or profitable) to reduce cholesterol, we see decision thresholds lowered. In another example: fever could signify an number of different ailments, but coupled with pain in the lower-right abdomen and an elevated white-cell count, would generally lead most doctors to offer a diagnostic opinion that a patient is suffering from acute appendicitis. For commonly occurring phenomena, as in this example, it is the aggregated correlational efficiency of multiple symptoms that becomes diagnostic. Back to validity. A test, method, or principle can be validated (multiple published and corroborating studies), but lack validity, which is really construct validity, or how well we think the polygraph measures what we say it measures.(i.e., lies and truth, in the case of voice-stress). Daubert specifies some important requirements for validity, including known estimates of both validity and reliability. As pointed out many times, tests, methods or principles can be reliable, though lack construct validity, but cannot offer usable construct validity with sufficient reliability. Reliability refers to the reproducibility of the results, and includes both interrater reliability (sometimes described through Kappa statistics), and test-retest reliability (sometimes expressed through a coefficient alpha statistic). Some testing paradigms do not assume any test-retest reliability. For example psychological tests of Axis I concerns such as depression, assume that severity will wax and wane – in these circumstances interrater reliability will become the primary concern. Other tests, such as psychological test for Axis II concerns such as intelligence or mental retardation, assume that the concern will remain stable over long periods of time – so test-retest reliability becomes and important concern along with interrater reliability. I believe Daubert also requires established methods for determining the likelihood of an error – which in polygraph brings us to decision theory and bayesian conditional probability. This becomes even more complex when dealing with mixed issue tests (more later). As an aside, I'm bothered by Identifi, which reports the test result as reliability. To me this makes no sense and appears to be an inaccurate use of the term. Just as there are different types of reliability, there are different types of validity (just don't tell politicians this). A perfect test would would never fail to notice what we are seeking, and never return a positive result for any reason other than that which we are seeking. Zero false negatives, and zero false positives. These objectives have common names, such as sensitivity, and specificity.The sensitivity of a test is modulated by both the construct validity (e.g., what is cholesterol, or what physiological phenomena are correlated with deception), and by the decision threshold (i.e., how much it too much). Because there is no such thing as a “perfect” test, researchers, test developers, and field practitioners have learned to compromise or bias tests for different purposes, and in fact have learned to exploit different types of test bias, using tests in combination, to produce outcomes that are superior to any that could be achieved with a single test or diagnostic method. This what Don Krapohl refers to as the successive hurdles model. In medical and psychological testing, its commonly understood as the strategic use of both screening tests (broad tests given to lots of people in the absence of known or specific concerns), and diagnostic tests (narrowly focused tests conducted in response to specific concerns). This is not to suggest that screening tests lack specificity, or that sensitivity is not a concern in the design of diagnostic tests; only that differing testing objectives take priority in different testing circumstances (i.e., whether there is a known incident or allegation, and whether there is reason to suspect a particular individual's involvement – or whether the identification of every possible serious problem, even in the absence of known incidents or obvious concerns, is the primary objective). Just as screening and diagnostic tests have differing objectives, their corresponding test results have different empirical meaning. Screening tests are intended to alert us to potential problems, while diagnostic tests are intended to inform us about necessary action. The most desirable form of validity for screening tests is sensitivity, and we know in advance that screening tests are expected to slightly over-predict the presence of problems. Alternatively, the most desirable form of validity for diagnostic tests is specificity, because we do not want to take action (surgery, chemotherapy, prenatal care, antibiotic or antiviral medications, or decisions that affect a persons' rights or liberty) unless it is actually necessary. Improvements in construct validity will generally improve test specificity (with an expected corresponding reduction in false-positive errors). However, it is not always realistic to expect that improvements in construct validity (with regulates test-specificity) will always contribute to improved sensitivity. Test sensitivity can be increased by lowering decision thresholds, and by improvements in construct validity (i.e., what is cholesterol and how do we find and measure it). In polygraph, test sensitivity can be improved by broadening the test to a mixed series of questions/concerns. Increasing the sensitivity of a test does not mean the test is less valid. Because sensitivity is a form of validity, increases in sensitivity actually improve the validity (screening sensitivity) of the test, but there is a trade off or compromise. Improving test sensitivity will not improve test specificity and can be expected to produce an increase in false-positive errors. As Don Krapohl points out, these can be recovered later – false negative errors cannot. For these reasons, I disagree with what I view as an over-simplified statement by Donnie Dutton, at attributed to Don Krapohl, that there is a trade-off in PCSOT testing between validity and utility. In my opinion this statement does the polygraph field a disservice, as it implies that screening tests are not “valid,” when in fact they are biased for a certain emphasis of validity – screening sensitivity; meaning that truthful people don't pass screening tests by accident or chance. Utility is a vaguely defined, if not wholly undefined, term in testing sciences. I believe the term is idiosyncratic to polygraphy, and may have been coined by someone such as Eric Holden, MA, LPC, (correct me if I am mistaken), in genuine attempt to account for the fact that PCSOT testing, in its early stages appeared to deviate from the principles of single-issue diagnostic (investigative) polygraph testing. Whether it was Eric Holden or not, what people didn't realize was that they were employing well recognized principles of screening tests in PCSOT testing. The difficulty in recognizing this may be partially due to the fact that the term “screening” in polygraph, was contaminated by under-regulated and under-standardized polygraph screening practices, prior to EPPA (1998). That experience does not relieve us of the responsibility to discard vague and idiosyncratic jargon and align ourselves with the vocabulary and principles employed by other related sciences. To simplify this matter, validity is really a question of what kind of validity we seek: screening sensitivity, diagnostic specificity, or what. The smartest way to use test is in combination, as in the successive hurdles model or Marin protocol. As always, tests do not replace the need for field investigation work, and test do not make decisions – tests only give information, which sometimes informs decisions that are made by thoughtful and well-trained professionals. I will contend that it is unethical for any professional to completely surrender one's professional authority to any test. Professionals are always responsible for their decisions. Because the concept of validity is somewhat complex and driven by situational objectives, we are beginning to see an emerging research base describing different scoring paradigms for different purposes. Don Krapohl mentioned that he and Barry C recently replicated and published a study on scoring rules for evidentiary tests (which in my experience are generally hoped to be exculpatory), which have different priorities or objectives than investigative tests. These tests differ still from screening tests. In the future, we may continue to see different protocols emerge based upon the testing circumstances and objectives and the type of validity those situations may emphasize as the highest priority. OK, I think I've made my point now. I wish I could be more succinct, but this topic is by nature somewhat abstract and ethereal. Peace,
------------------ "Gentlemen, you can't fight in here, this is the war room." --(Dr. Strangelove, 1964) [This message has been edited by rnelson (edited 07-28-2006).] IP: Logged |
J.B. McCloughan Administrator
|
posted 07-27-2006 11:44 PM
Ray,Utility is for the most part an economics term. http://en.wikipedia.org/wiki/Utility I agree with you that the topics of statistics and scientific method are discussed in short order most of the time. For the most part, I believe it is done in this manner due to a lack of interest in its discussion. At the APA seminar in 2003 an examiner stood up during a class and said something similar to, “Just tell me which one works and I’ll use it.” He didn’t want to here about validity or argue about theories any longer. He simply wanted a consensus on the issues so as to have a rubber stamp. I love science and I love searching for the aforementioned answers and I can tell from your posts and personal discussion you do as well. There is an important roll for science in polygraph. From an investigative standpoint, field application is not always scientific. As has been said by many a good detective, sometimes you use a sniper rifle and other times a shotgun is a better choice. It is my belief that the question that needs to be asked and answered from a utility (satisfactory consumption) standpoint is does the product satisfy the needs of the consumer. In my humble opinion, this boils down to whether the polygraph examination provides something better than that which would be used if it were not available and does so at a cost that justifies its use. If investigators are 40-60% accurate at discerning truths from lies, any tool used to aid the investigator should boost the investigators base ability to come to the correct decision. Detection of deception methods that fall below this could actually deteriorate the investigators ability (e.g. CVSA). If the method only boosts the investigators ability by 2% and costs $100,000, the expense of a method might deter one from using it. This is one of the, if not the, main question posed to alternative detection of deception methods (MRI, P300, etc.). In the end, decision makers will decide what an acceptable I/O ratio is. The other definition of utility (adjective form) deals with the ability to be used as a substitute in multiple rolls. I think this is quite self explanatory when it comes to polygraph. The science side still needs to be present and forwarded to provide testing procedures that are defensible in court under Daubert standards. This moves certain methods of polygraph into the evidentiary arena. Currently the definition given from the polygraph community for evidentiary examination is that of a stipulated examination. But under a more true definition of the word evidentiary an exam has the potential to be evidence by way of its method and what is reported. If the method meets the criteria for admission then it may be submitted. However, even if polygraph was 99.9% correct I think that any opinion that might subvert or replace the trier of fact’s opinion would be hard pressed to be readily admitted into the court. This is a good topic of discussion and one that will hopefully be looked at more seriously. However, it is late and I am sure that certain portions of my previous ramblings could suffice some people’s definition of non-sequencer babble.
[This message has been edited by J.B. McCloughan (edited 07-28-2006).] IP: Logged |
rnelson Member
|
posted 07-28-2006 12:48 AM
I'm aware of that. The concept comes from the 19th century British philosopher John Stuart Mill and a guy named Bentam, an odd old geezer who I think had himself stuffed after his death. They defined for us one of three major ethical paradigms in philosophy - utilitarianism, which defines imperfect decisions as ethiically sound if they benefit more people than they frustrate (hence the economic emphasis). Utilitarian ethics underly our criminal justice system, so we are well-versed in this kind of thinking. It is also the kind of ethical thinking that justified slavery (which was argued to benefit the majority at cost to the minority). So, utilitarian ethics alone cannot be our sole guide, and we also owe much to folks like Emmanuel Kant, who articulated a deontological ethical paradigm in which we have certain duties and obligations to all persons as individuals. Clinical work is based largely on deontological foundations, in which the practicioner's obligation is to the individual patient. Of course, this is not the case in sex offender treatment - which is why offense specific treatment is distinct from other forms of treatment (in addition to its emphasis on rationatist empistemological thought vs. the post-modern and deconstrucionist philosophies that most other forms of therapy are built around.) In case anyone is wondering the third major ethical paradigm comes all the way from Aristole - only he used the term aesthetics. Ethics was not regarded as a distinct philosphical concern in western philosophy until about the 18th or 19th century.I would still argue that it is a rather vague side-step of the question of validity, when discussing the use of the polygraph in sex offender programs. Our consumers intend upon regarding and depending upon the test results as accurate. I think we're better off in the long run to think things through in the common language of science and testing. r ------------------ "Gentlemen, you can't fight in here, this is the war room." --(Dr. Strangelove, 1964) IP: Logged |
Barry C Member
|
posted 07-28-2006 09:23 AM
Ray,Why not consider some posts on validity and reliability and research methods. I've argued that we don't get enough science, research, statistics, probability, etc. in basic polygraph school, and I suspect that is why we get people who just want to be told what works (or what to do). They have no way to know if what they are being told works really does work. Many preach tradition as science when science has shown otherwise, e.g., outside issue questions, CQs answered "yes" instead of "no," etc. Lou Rovner has been writing in the APA magazine on the topic, and he's kept things very basic and easy to understand. (I just hope people are reading it.) I know you've taught on the topic, so perhaps you can start a few "educational" threads here. Keep it simple and relevant, and I think with some Q&A down the road you'll make some headway. A lot of people read this, but many don't post. I'd start out with the basics of the scientific method and add from there based on questions and input. If I remember correctly, you've done a lot of the work already so you could cut-and-paste then edit a little? It's just a thought, and you're one of the few I know whose passion makes a dry topic (to most) interesting. Maybe Lou would chime in here too. I think he teaches grads or undergrads that stuff, and he clearly knows how to make a difficult subject seem rather simple. I forgot to mention my main point: I don't think many would disagree too much with what you've said, the problem is the real answers seem to take up too much time and are received with boredom or a "just tell me what to do" mentality. Since you're willing to talk about it, I'm encouraging you to do more, but break it down into bite-sized pieces. [This message has been edited by Barry C (edited 07-28-2006).] IP: Logged |
rnelson Member
|
posted 07-28-2006 11:53 AM
quote: It is my belief that the question that needs to be asked and answered from a utility (satisfactory consumption) standpoint is does the product satisfy the needs of the consumer. In my humble opinion, this boils down to whether the polygraph examination provides something better than that which would be used if it were not available and does so at a cost that justifies its use.If investigators are 40-60% accurate at discerning truths from lies, any tool used to aid the investigator should boost the investigators base ability to come to the correct decision. Detection of deception methods that fall below this could actually deteriorate the investigators ability (e.g. CVSA).
That is well-stated. However, the empirical concept which you describe here is referred to as incremental validity, which means that the use of good tools, even if imperfect, is empirically and ethically sound if they contribute to improved decision accuracy. Decision accuracy refers to decisions by investigators, therapists, risk management professionals, program managers, courts, and supervising officers - not polygraph decisions. In testing science, test don't make decisions, they give information. Test results are not decisions themselves, but probability statements regarding the veracity or significance (statistical significance) of our questions, assumptions or hypotheses. I would still argue that we should out-grow the language of utility, as it is not commonly understood in testing sciences, and tends to promote a laissez faire atttitude among polygraph examiners regarding questions of validity. We say things like "well, its a utility test," and turn a negligent nod towards the fact that our consumers want to employ the polygraph test and results when making decisions that affect people's lives, rights, and liberties. There is an impulse for some professionals to desire simple answers, and surrender or externalize too much professional authority. They say things like "we'll let the polygraph decide." I have no problem with polygraphs ordered by courts prior to sentencing. Judges are generally very intelligent, and understand that they make the final decision. In the vein of incremental validity, polygraph is correctly used as a decision support tool. Our inability to engage some of these lofty and dicey conversations is in part why some of the well-educated folks at ATSA are not yet wild about the polygraph. Carl Jung aside, for the last 100 years we have favored a largely scientific version of psychology. There is a movement in sex offender treatment around the country to emphasize evidence-based treatment and intervention methods. If we don't keep up, then we will be left behind. Barry, I met Lou Rovner at APA. He's great. We've been crossing paths for years without knowing it. I think he taught at Cal Lutheran - very close to where I lived as young child, and is currently located very near my old stomping grounds as a teenager. Polygraph and psychology aside, we have other similar avocational interests in common. Not many people would recognize names such as Malcolm McNab or Rick Baptist. r ------------------ "Gentlemen, you can't fight in here, this is the war room." --(Dr. Strangelove, 1964)
[This message has been edited by rnelson (edited 07-28-2006).] IP: Logged |
J.B. McCloughan Administrator
|
posted 07-28-2006 10:51 PM
Ray,You are correct that the part you quoted does fall under incremental validity. As for utility, I found the following web page: http://www.incrisis.org/Articles/ReliabilityValidityUseful.htm quote: UsefulnessThe usefulness of a questionnaire is often referred to as its utility. The utility of a questionnaire is defined as the value or cost of using the questionnaire to identify the attribute, state, quality or event we want to identify. There is more than one way to identify a state, event, attribute or quality. Some methods require less effort or fewer resources than others. The idea is to use surveys and questionnaires that are efficient, have a low risk of harm and are cost effective. A questionnaire with high utility is one where the cost of identifying an attribute or quality is low and the cost of being wrong is not high. Another term for utility is the "usefulness" of an instrument, although "usefulness" does not have the precise definition of "utility" within the field of statistics.
A google search of "utility of a test" brings up a number of studies, including this http://folk.ntnu.no/stoylen/strainrate/Howto/Clinical.html . My point was and is that "utility" is not a term that is used only by the polygraph profession and that utility should be in some way attached to incremental validity. It would seem logical that utility would need incremental validity to succeed but I could easily think of some examples where this is not the case. It is of the utmost importance that we do continue to engage in these types of discussions. I think this would make an excellent forum at an APA seminar. IP: Logged |
rnelson Member
|
posted 07-29-2006 09:24 AM
J.B. McGloughan,That is a really good article. The section on stats is very well done, despite the minor awkwardness in English translations. Its a really good example of the fact that medical testing professionals face the exact same empirical challenges that we face as polygraph professionals. In this cardiac imaging example Dr. Stoylen reports a negative predictive value of .99 and a positive predictive value of .5. These numbers are basically identical to The statistical concepts are the same one's we've been hashing through here. The stuff on ROC analysis is important. We we should be emphasizing that more, because it presents decision accuracy as an area under the curve (percentile) which is actually a plot of all possible decision thresholds. The reason this becomes important, it that ROCs are more immune to the effects of base rate variation than simple percent-correct stats - meaning that ROC becomes a more robust index of the actual predictive validity (actually a posteriori validity, according to Bayes). He even emphasizes the idea of added-value (incremental validity), and the known phenomena that post-processing in research design improves the reliability (and thus the potential validity) of the data, and that the absence of such post processing is a weakness when testing models are generalized to field practice. Good find. So, I'll concede that the idea of test utility is not unique to polygraph. However, I think your findings also emphasize my point that utility does not de-emphasize the validity[i/] of a test, and does not represent an excuse to neglect the challenge of coming to terms with the actual mathematical validity of our models. Instead, [i]utility is the emphasis of incremental validity (what your author calls added value in his translated English). In other words, the use of imperfect tests can improve the validity of decisions, when employed by thoughtful well-trained professionals. ------------------ "Gentlemen, you can't fight in here, this is the war room." --(Dr. Strangelove, 1964) IP: Logged | |